One drawback of modern transformers is that each token uses the same amount of predictive compute. However, some tokens are much easier to predict than others. This work from DeepMind allows models to exit early during generation to spend less flops on certain tokens, effectively opening the door to dynamic compute - with a fixed maximum. The results are 50% fewer flops at generation time for equivalent performance.
Friday, April 5, 2024